The following machine learning project focuses on…
Warning: package 'ggplot2' was built under R version 4.3.1
Warning: package 'dplyr' was built under R version 4.3.1
Warning: package 'patchwork' was built under R version 4.3.1
Warning: package 'maps' was built under R version 4.3.1
Warning: package 'scales' was built under R version 4.3.1
Warning: package 'ggmap' was built under R version 4.3.1
Warning: package 'knitr' was built under R version 4.3.1
Warning: package 'rmarkdown' was built under R version 4.3.1
Warning: package 'lattice' was built under R version 4.3.1
Warning: package 'plotly' was built under R version 4.3.1
Warning: package 'leaflet' was built under R version 4.3.1
1 Introduction
1.1 Overview and Motivation
1.1.1 Context and Background
The Swiss real estate market, characterized by its resilience and complexity, presents a significant opportunity for advanced analytical approaches to understand pricing dynamics. This project, undertaken as part of a Master’s degree in Machine Learning at the University of Lausanne, aims to harness the power of data science to predict real estate market prices in Switzerland. Utilizing contemporary machine learning techniques within this academic framework not only enhances the learning experience but also contributes to a practical understanding of real estate valuation.
As housing prices continue to fluctuate amid economic uncertainties, such as interest rate changes and demographic shifts, this investigation is not only timely but also of significant importance to potential investors, policymakers, and the academic community.
1.1.2 Aim Of The Investigation
The primary objective of this study is to predict Swiss real estate market prices using real-time data scraped from ImmoScout24, a prominent Swiss real estate website. This study addresses the significant question of
How can machine learning models utilize real-time data scraped from online real estate platforms to predict price trends in the Swiss real estate market?
How can machine learning models predict the sale prices of real estate properties in Switzerland based on current market data?
The relevance of this investigation is underpinned by the substantial financial implications of real estate investments and the benefit of predictive insights for both buyers and sellers in the market. The relevance of this study is underscored by the potential insights it could offer, where real estate plays a pivotal role in financial stability and growth.
1.1.3 Description of the Data
The data for this study consists of a meticulously compiled dataset from ImmoScout24, featuring a wide array of variables related to property listings across Switzerland. Fields in the dataset include price, number of rooms, square meters, address, canton, property type, floor, and year of construction. These data points have been gathered through a robust scraping algorithm designed to collect a comprehensive snapshot of the current market. This dataset provides a granular view of the market, essential for training robust machine learning models.
1.1.4 Methodology
This project employs model-based machine learning techniques to quantify the impact of various factors on property prices in Switzerland. The methodology involves training predictive models to interpret the complex relationships within the data, providing a statistical basis for price prediction. This approach allows for an examination of both linear dependencies and more intricate interactions, such as how location and property type combine to influence pricing.
1.1.5 Structure of the Report
The report is structured as follows to provide a coherent narrative and logical flow of analysis:
Section 1: Introduction - Outlines the research context, objectives, and significance.
Section 2: Data - Details the sources, nature, and preprocessing of the data used.
Section 3: Exploratory Data Analysis (EDA) - Analyzes the data to uncover patterns and anomalies.
# Identify values causing the issueproblematic_values <- properties$number_of_rooms[is.na(as.numeric(properties$number_of_rooms))]#> Warning: NAs introduced by coercion# Replace non-numeric values with NA#properties$number_of_rooms <- as.numeric(gsub("[^0-9.]", "", properties$number_of_rooms))# Remove non-numeric characters and convert to numericproperties$price <-as.numeric(gsub("[^0-9]", "", properties$price))# Subset the dataset to exclude rows with price < 20000properties <- properties[properties$price >=20000, ]# Subset the dataset to exclude rows with numbers of rooms < 25#properties <- properties[properties$number_of_rooms <25, ]# Replace incomplete addressesproperties$address <-gsub("^\\W*[.,0-]\\W*", "", properties$address)properties_filtered <-na.omit(properties)properties_filtered$year_category <-substr(properties_filtered$year_category, 1, 9)# Assuming 'year_category' is a column in the 'properties' datasetproperties_filtered$year_category <-as.factor(properties_filtered$year_category)# remove m^2 from column 'square_meters'properties_filtered$square_meters <-as.numeric(gsub("\\D", "", properties_filtered$square_meters))# print how many NA observations left in square_metersprint(sum(is.na(properties_filtered$square_meters)))#> [1] 1056# remove NAproperties_filtered <- properties_filtered[!is.na(properties_filtered$square_meters),]# add majuscule to cantonproperties_filtered$canton <- tools::toTitleCase(properties_filtered$canton)# # Preprocess the number_of_rooms columnproperties_filtered$number_of_rooms <-gsub(" rooms", "", properties_filtered$number_of_rooms)properties_filtered$number_of_rooms <-gsub(" room", "", properties_filtered$number_of_rooms)# Remove rows with "m²" in the "number_of_rooms" columnproperties_filtered <- properties_filtered[!grepl("m²", properties_filtered$number_of_rooms), ]properties_filtered$number_of_rooms <-as.numeric(properties_filtered$number_of_rooms)# Remove rows with rooms >= 100properties_filtered <- properties_filtered[properties_filtered$number_of_rooms <=100, , drop =FALSE]# Divide cells by 10 if number of rooms is more than 27properties_filtered$number_of_rooms <-ifelse(properties_filtered$number_of_rooms >27, properties_filtered$number_of_rooms /10, properties_filtered$number_of_rooms)#test <- properties_filtered# properties_filtered$number_of_rooms <- as.character(properties_filtered$number_of_rooms)# properties_filtered$number_of_rooms <- gsub("\\D", properties_filtered$number_of_rooms) # Remove non-numeric characters# properties_filtered$number_of_rooms <- as.numeric(properties_filtered$number_of_rooms) # Convert to numeric# properties_filtered$number_of_rooms <- trunc(properties_filtered$number_of_rooms) # Truncate non-integer values# show 100 first row of cleaned dataset using reactablereactable(head(properties_filtered, 100))
The dataset described is the “Official Index of Localities” (Répertoire officiel des localités) provided by the Swiss Federal Office of Topography (swisstopo). It contains comprehensive information about all localities in Switzerland and the Principality of Liechtenstein, including their names, postal codes, and perimeters.
This dataset is regularly updated on a monthly basis, incorporating changes reported by cantonal authorities and Swiss Post. It serves various purposes, including spatial analysis, integration with other geographic datasets, usage as a geolocated background in GIS (Geographic Information Systems) and CAD (Computer-Aided Design) systems, statistical analysis, and as a reference dataset for information systems.
Updates and release notes for the dataset are provided periodically, detailing changes and improvements made over time. The Swiss Federal Office of Topography manages and distributes this dataset as part of its responsibilities in collecting and providing official geospatial data for Switzerland.
2.1.3.1 Creating Variable zip_code and merging with AMTOVZ_CSV_LV95
Code
df <- properties_filtered#the address column is like : '1844 Villeneuve VD' and has zip code number in it#taking out the zip code number and creating a new column 'zip_code'#the way to identify the zip code is to identify numbers that are 4 digits longdf$zip_code <-as.numeric(gsub("\\D", "", df$address))#removing the first two number of zip code has more than 4 numberdf$zip_code <-ifelse(df$zip_code >9999, df$zip_code %%10000, df$zip_code)
2.1.3.2 Using AMTOVZ_CSV_LV95 to get the city and canton from the zip code
Code
#read .csv AMTOVZ_CSV_LV95amto <-read.csv(file.path(here(),"data/AMTOVZ_CSV_WGS84.csv"), sep =";")#creating a new dataframe with 'Ortschaftsname' as 'City'Place_name', 'PLZ' as 'zip_code', 'KantonskÃ.rzel' as 'Canton_code', 'E' as 'lon' and 'N' as 'lat'amto_df <- amto[, c('Gemeindename', 'PLZ', 'Kantonskürzel', 'E', 'N')]#renaming the columnscolnames(amto_df) <-c('Community', 'zip_code', 'Canton_code', 'lon', 'lat')#remove duplicates of zip codeamto_df <- amto_df[!duplicated(amto_df$zip_code),]#add the variable of amto_df to the df if the zip code matchesdf <-merge(df, amto_df, by ="zip_code", all.x =TRUE)#check if there are nan in citydf[is.na(df$Community),]#> zip_code price number_of_rooms square_meters#> 1 25 2200000 10.0 263#> 2 25 2200000 6.5 165#> 3 26 655000 3.5 66#> 4 26 1995000 7.5 180#> 5 322 870000 2.5 59#> 6 322 880000 2.5 55#> 7 322 975000 3.5 56#> 230 1014 1510000 5.5 146#> 1137 1200 16092000 7.0 400#> 1138 1200 679000 5.5 142#> 1139 1200 3285450 5.0 230#> 5481 1919 2558620 5.5 270#> 5482 1919 1908000 6.5 210#> 5483 1919 1065000 4.5 130#> 5484 1919 785000 3.5 103#> 7624 2500 1100000 5.0 154#> 7625 2500 872500 4.5 144#> 7626 2500 420000 4.5 115#> 7627 2500 1450000 5.5 198#> 7628 2500 885500 5.5 130#> 7629 2500 872500 4.5 138#> 7630 2500 892500 4.5 144#> 7631 2500 885500 5.5 130#> 7632 2500 887500 5.5 130#> 7633 2500 1050000 4.5 121#> 7634 2500 877500 4.5 138#> 7635 2500 870500 4.5 125#> 7636 2500 887500 4.5 144#> 8328 3000 820000 5.5 165#> 8329 3000 1140000 3.5 115#> 8330 3000 1090000 3.5 115#> 8331 3000 1090000 5.5 193#> 8332 3000 1090000 5.5 193#> 8333 3000 720000 3.5 102#> 8334 3000 920000 4.5 157#> 8335 3000 920000 4.5 157#> 8336 3000 1590000 5.5 330#> 10437 4000 975000 4.5 125#> 10438 4000 180000 3.0 70#> 10439 4000 2100000 6.5 360#> 12362 5201 725000 3.5 95#> 13215 6000 695000 4.5 133#> 13968 6511 440000 2.0 64#> 14244 6547 15000000 7.5 220#> 14562 6602 2800000 6.5 250#> 14563 6602 2800000 7.5 242#> 14564 6602 450000 3.5 75#> 14565 6602 270000 1.5 28#> 14566 6604 760000 3.5 78#> 14567 6604 1990000 4.5 220#> 14568 6604 2668590 5.5 290#> 16581 6901 3660930 4.5 290#> 16582 6901 3660930 4.5 290#> 16583 6903 790000 3.5 105#> 16584 6907 995000 4.5 114#> 16585 6907 995000 4.5 114#> 16586 6911 469350 5.5 140#> 16587 6911 737550 4.5 82#> 16588 6911 660000 7.5 200#> 16589 6911 610000 3.5 103#> 17900 7133 2266290 5.5 160#> 17909 7135 2690000 8.5 236#> 18169 8000 2100000 4.5 152#> 18170 8000 1650000 4.5 142#> 18171 8000 925000 3.5 102#> 18172 8000 1650000 4.5 142#> 18173 8000 1150000 4.5 128#> 18174 8000 1450000 5.5 143#> 18175 8000 1990000 5.5 200#> 18176 8000 975000 4.5 122#> 18177 8000 1990000 5.5 200#> 18178 8000 2495000 5.5 482#> 18658 8238 245000 2.0 49#> 19082 8423 2110000 6.5 204#> 19083 8423 2190000 5.5 167#> 20296 9241 545000 4.5 100#> 20297 9241 730840 5.5 130#> address#> 1 1000 Lausanne 25#> 2 1000 Lausanne 25#> 3 1000 Lausanne 26#> 4 Lausanne 26, 1000 Lausanne 26#> 5 Via Cuolm Liung 30d, 7032 Laax GR 2#> 6 7032 Laax GR 2#> 7 Via Murschetg 29, 7032 Laax GR 2#> 230 1014 Lausanne#> 1137 1200 Genève#> 1138 Chemin des pralets, 74100 Etrembières, 1200 Genève#> 1139 1200 Genève#> 5481 1919 Martigny#> 5482 1919 Martigny#> 5483 1919 Martigny#> 5484 1919 Martigny#> 7624 2500 Biel/Bienne#> 7625 2500 Biel/Bienne#> 7626 2500 Biel/Bienne#> 7627 2500 Bienne#> 7628 2500 Biel/Bienne#> 7629 2500 Biel/Bienne#> 7630 2500 Biel/Bienne#> 7631 2500 Biel/Bienne#> 7632 2500 Biel/Bienne#> 7633 Hohlenweg 11b, 2500 Biel/Bienne#> 7634 2500 Biel/Bienne#> 7635 2500 Biel/Bienne#> 7636 2500 Biel/Bienne#> 8328 3000 Bern#> 8329 3000 Bern#> 8330 3000 Bern#> 8331 3000 Bern#> 8332 3000 Bern#> 8333 3000 Bern#> 8334 3000 Bern#> 8335 3000 Bern#> 8336 3000 Bern#> 10437 4000 Basel#> 10438 Lörrach Brombach Steinsack 6, 4000 Basel#> 10439 4000 Basel#> 12362 5201 Brugg AG#> 13215 in TRIENGEN, ca. 20 min. bei Luzern, 6000 Luzern#> 13968 6511 Cadenazzo#> 14244 Augio 1F, 6547 Augio#> 14562 6602 Muralto#> 14563 6602 Muralto#> 14564 Via Bacilieri 2, 6602 Muralto#> 14565 6602 Muralto#> 14566 6604 Locarno#> 14567 6604 Solduno#> 14568 6604 Solduno#> 16581 6901 Lugano#> 16582 6901 Lugano#> 16583 6903 Lugano#> 16584 6907 MASSAGNO#> 16585 6907 MASSAGNO#> 16586 6911 Campione d'Italia#> 16587 6911 Campione d'Italia#> 16588 6911 Campione d'Italia#> 16589 6911 Campione d'Italia#> 17900 Inder Platenga 34, 7133 Obersaxen#> 17909 7135 Fideris#> 18169 8000 Zürich#> 18170 8000 Zürich#> 18171 8000 Zürich#> 18172 8000 Zürich#> 18173 8000 Zürich#> 18174 8000 Zürich#> 18175 8000 Zürich#> 18176 8000 Zürich#> 18177 8000 Zürich#> 18178 8000 Zürich#> 18658 Stemmerstrasse 14, 8238 Büsingen am Hochrhein#> 19082 Chüngstrasse 60, 8423 Embrach#> 19083 Chüngstrasse 48, 8423 Embrach#> 20296 9241 Kradolf#> 20297 9241 Kradolf#> canton property_type floor year_category Community#> 1 Vaud Single house 1919-1945 <NA>#> 2 Vaud Villa 2006-2010 <NA>#> 3 Vaud Apartment noteg 2016-2024 <NA>#> 4 Vaud Villa 1961-1970 <NA>#> 5 Grisons Apartment eg 2016-2024 <NA>#> 6 Grisons Apartment noteg 2016-2024 <NA>#> 7 Grisons Apartment noteg 2011-2015 <NA>#> 230 Vaud Apartment eg 2011-2015 <NA>#> 1137 Geneva Single house 2011-2015 <NA>#> 1138 Geneva Bifamiliar house 2016-2024 <NA>#> 1139 Geneva Bifamiliar house 1981-1990 <NA>#> 5481 Valais Attic flat noteg 2016-2024 <NA>#> 5482 Valais Apartment noteg 2016-2024 <NA>#> 5483 Valais Apartment noteg 2016-2024 <NA>#> 5484 Valais Apartment noteg 2016-2024 <NA>#> 7624 Bern Single house 2001-2005 <NA>#> 7625 Bern Villa 2016-2024 <NA>#> 7626 Bern Apartment noteg 1971-1980 <NA>#> 7627 Bern Single house 2016-2024 <NA>#> 7628 Bern Villa 2016-2024 <NA>#> 7629 Bern Single house 2016-2024 <NA>#> 7630 Bern Single house 2016-2024 <NA>#> 7631 Bern Single house 2016-2024 <NA>#> 7632 Bern Single house 2016-2024 <NA>#> 7633 Bern Single house 2001-2005 <NA>#> 7634 Bern Single house 2016-2024 <NA>#> 7635 Bern Single house 2016-2024 <NA>#> 7636 Bern Single house 2016-2024 <NA>#> 8328 Bern Apartment noteg 2016-2024 <NA>#> 8329 Bern Apartment eg 2016-2024 <NA>#> 8330 Bern Apartment eg 2016-2024 <NA>#> 8331 Bern Roof flat noteg 2016-2024 <NA>#> 8332 Bern Apartment noteg 2016-2024 <NA>#> 8333 Bern Apartment eg 2016-2024 <NA>#> 8334 Bern Duplex noteg 2016-2024 <NA>#> 8335 Bern Apartment noteg 2016-2024 <NA>#> 8336 Bern Apartment noteg 1991-2000 <NA>#> 10437 Basel-Stadt Single house 2016-2024 <NA>#> 10438 Basel-Stadt Single house 1961-1970 <NA>#> 10439 Basel-Stadt Villa 2016-2024 <NA>#> 12362 Aargau Apartment noteg 2016-2024 <NA>#> 13215 Lucerne Apartment noteg 1991-2000 <NA>#> 13968 Ticino Apartment noteg 2016-2024 <NA>#> 14244 Grisons Single house 2016-2024 <NA>#> 14562 Ticino Single house 1981-1990 <NA>#> 14563 Ticino Single house 1981-1990 <NA>#> 14564 Ticino Apartment noteg 1946-1960 <NA>#> 14565 Ticino Apartment eg 1961-1970 <NA>#> 14566 Ticino Apartment noteg 2011-2015 <NA>#> 14567 Ticino Attic flat noteg 2011-2015 <NA>#> 14568 Ticino Apartment noteg 2011-2015 <NA>#> 16581 Ticino Attic flat noteg 2011-2015 <NA>#> 16582 Ticino Apartment noteg 2011-2015 <NA>#> 16583 Ticino Apartment noteg 2006-2010 <NA>#> 16584 Ticino Apartment noteg 2016-2024 <NA>#> 16585 Ticino Apartment noteg 2016-2024 <NA>#> 16586 Ticino Apartment noteg 1946-1960 <NA>#> 16587 Ticino Apartment noteg 1991-2000 <NA>#> 16588 Ticino Single house 1971-1980 <NA>#> 16589 Ticino Apartment eg 1946-1960 <NA>#> 17900 Grisons Single house 2006-2010 <NA>#> 17909 Grisons Single house 0-1919 <NA>#> 18169 Zurich Apartment noteg 2016-2024 <NA>#> 18170 Zurich Attic flat noteg 2016-2024 <NA>#> 18171 Zurich Apartment noteg 2016-2024 <NA>#> 18172 Zurich Apartment noteg 2016-2024 <NA>#> 18173 Zurich Apartment noteg 2016-2024 <NA>#> 18174 Zurich Apartment eg 2016-2024 <NA>#> 18175 Zurich Apartment noteg 2006-2010 <NA>#> 18176 Zurich Single house 2016-2024 <NA>#> 18177 Zurich Attic flat noteg 2006-2010 <NA>#> 18178 Zurich Apartment noteg 0-1919 <NA>#> 18658 Schaffhausen Apartment noteg 1961-1970 <NA>#> 19082 Zurich Bifamiliar house 2016-2024 <NA>#> 19083 Zurich Single house 2016-2024 <NA>#> 20296 Thurgau Apartment noteg 1991-2000 <NA>#> 20297 Thurgau Apartment noteg 1991-2000 <NA>#> Canton_code lon lat#> 1 <NA> NA NA#> 2 <NA> NA NA#> 3 <NA> NA NA#> 4 <NA> NA NA#> 5 <NA> NA NA#> 6 <NA> NA NA#> 7 <NA> NA NA#> 230 <NA> NA NA#> 1137 <NA> NA NA#> 1138 <NA> NA NA#> 1139 <NA> NA NA#> 5481 <NA> NA NA#> 5482 <NA> NA NA#> 5483 <NA> NA NA#> 5484 <NA> NA NA#> 7624 <NA> NA NA#> 7625 <NA> NA NA#> 7626 <NA> NA NA#> 7627 <NA> NA NA#> 7628 <NA> NA NA#> 7629 <NA> NA NA#> 7630 <NA> NA NA#> 7631 <NA> NA NA#> 7632 <NA> NA NA#> 7633 <NA> NA NA#> 7634 <NA> NA NA#> 7635 <NA> NA NA#> 7636 <NA> NA NA#> 8328 <NA> NA NA#> 8329 <NA> NA NA#> 8330 <NA> NA NA#> 8331 <NA> NA NA#> 8332 <NA> NA NA#> 8333 <NA> NA NA#> 8334 <NA> NA NA#> 8335 <NA> NA NA#> 8336 <NA> NA NA#> 10437 <NA> NA NA#> 10438 <NA> NA NA#> 10439 <NA> NA NA#> 12362 <NA> NA NA#> 13215 <NA> NA NA#> 13968 <NA> NA NA#> 14244 <NA> NA NA#> 14562 <NA> NA NA#> 14563 <NA> NA NA#> 14564 <NA> NA NA#> 14565 <NA> NA NA#> 14566 <NA> NA NA#> 14567 <NA> NA NA#> 14568 <NA> NA NA#> 16581 <NA> NA NA#> 16582 <NA> NA NA#> 16583 <NA> NA NA#> 16584 <NA> NA NA#> 16585 <NA> NA NA#> 16586 <NA> NA NA#> 16587 <NA> NA NA#> 16588 <NA> NA NA#> 16589 <NA> NA NA#> 17900 <NA> NA NA#> 17909 <NA> NA NA#> 18169 <NA> NA NA#> 18170 <NA> NA NA#> 18171 <NA> NA NA#> 18172 <NA> NA NA#> 18173 <NA> NA NA#> 18174 <NA> NA NA#> 18175 <NA> NA NA#> 18176 <NA> NA NA#> 18177 <NA> NA NA#> 18178 <NA> NA NA#> 18658 <NA> NA NA#> 19082 <NA> NA NA#> 19083 <NA> NA NA#> 20296 <NA> NA NA#> 20297 <NA> NA NA
We have 77 NAN, where
The zip code was not found in the atmo df
The zip code was incorectly isolated from the address
Removed them ::: {.cell layout-align=“center”}
Code
#remove the rows with nan in cityproperties_filtered <- df[!is.na(df$Community),]reactable(head(properties_filtered, 100))
# read csvimpots <-read.csv(file.path(here(),"data/estv_income_rates.csv"), sep =",", header =TRUE, stringsAsFactors =FALSE)# Remove 1st rowimpots <- impots[-1, ]# Remove 3rd columnimpots <- impots[, -3]# Combine text for columns 4-8impots[1, 4:8] <-"Impôt sur le revenu"# Combine text for columns 9-13impots[1, 9:13] <-"Impôt sur la fortune"# Combine text for columns 14-16impots[1, 14:16] <-"Impôt sur le bénéfice"# Combine text for columns 17-19impots[1, 17:19] <-"Impôt sur le capital"# Combine content of the first 2 rows into the 2nd rowimpots[2, ] <-apply(impots[1:2, ], 2, function(x) paste(ifelse(is.na(x[1]), x[2], ifelse(is.na(x[2]), x[1], paste(x[1], x[2], sep =" ")))))# Remove 1st rowimpots <- impots[-1, ]# Assign the text to the 1st row and 1st columnimpots[1, 1] <-"Coefficient d'impôt en %"# Replace column names with the content of the first rowcolnames(impots) <- impots[1, ]impots <- impots[-1, ]# Check for missing values in impotsany_missing <-any(is.na(impots))if (any_missing) {print("There are missing values in impots.")} else {print("There are no missing values in impots.")}#> [1] "There are no missing values in impots."# Replace row names with the content of the 3rd columnrow.names(impots) <- impots[, 3]impots <- impots[, -3]# Remove 2nd column (to avoid canton column)impots <- impots[, -2]# Remove impot egliseimpots <- impots[, -c(4:6)]impots <- impots[, -c(6:8)]impots <- impots[, -8]impots <- impots[, -10]# Clean data and convert to numericcleaned_impots <-apply(impots, 2, function(x) as.numeric(gsub("[^0-9.-]", "", x)))# Replace NA values with 0cleaned_impots[is.na(cleaned_impots)] <-0# Check for non-numeric valuesnon_numeric <-sum(!is.na(cleaned_impots) &!is.numeric(cleaned_impots))if (non_numeric >0) {print(paste("Warning: Found", non_numeric, "non-numeric values."))}rownames(cleaned_impots) <-rownames(impots)#reactable(head(cleaned_impots, 100))
2.1.5 Commune Data
2.1.5.1 Cleaning
ajouter source
ajouter description
expliquer blabla
Replaces NAs in both Taux de couverture social and Political (Conseil National Datas) For Taux de couverture Social: NAs were due to reason “Q” = “Not indicated to protect confidentiality” We replaced the NAs by the average taux de couverture in Switzerland in 2019, which was 3.2%
For Political data: NAs were due to reason “M” = “Not indicated because data was not important or applicable” Therefore, we replaced the NAs by 0
Code
# il faudra changer le pathcommune_prep <-read.csv(file.path(here(),"data/commune_data.csv"), sep =";", header =TRUE, stringsAsFactors =FALSE)# We keep only 2019 to have some reference? (2020 is apparently not really complete)commune_2019 <-subset(commune_prep, PERIOD_REF =="2019") %>%select(c("REGION", "CODE_REGION", "INDICATORS", "VALUE", "STATUS"))# delete les lignes ou Status = Q ou M (pas de valeur) et ensuite on enlève la colonnecommune_2019 <-subset(commune_2019, STATUS =="A") %>%select(c("REGION", "CODE_REGION", "INDICATORS", "VALUE"))# on enlève les lignes qui sont des aggrégatscommune_2019 <-subset(commune_2019, REGION !="Schweiz")commune_2019 <- commune_2019 %>%pivot_wider(names_from = INDICATORS, values_from = VALUE)# Rename columns using the provided mapcommune <- commune_2019 %>%rename(`Population - Habitants`= Ind_01_01,`Population - Densité de la population`= Ind_01_03,`Population - Etrangers`= Ind_01_08,`Population - Part du groupe d'âge 0-19 ans`= Ind_01_04,`Population - Part du groupe d'âge 20-64 ans`= Ind_01_05,`Population - Part du groupe d'âge 65+ ans`= Ind_01_06,`Population - Taux brut de nuptialité`= Ind_01_09,`Population - Taux brut de divortialité`= Ind_01_10,`Population - Taux brut de natalité`= Ind_01_11,`Population - Taux brut de mortalité`= Ind_01_12,`Population - Ménages privés`= Ind_01_13,`Population - Taille moyenne des ménages`= Ind_01_14,`Sécurité sociale - Taux d'aide sociale`= Ind_11_01,`Conseil national - PLR`= Ind_14_01,`Conseil national - PDC`= Ind_14_02,`Conseil national - PS`= Ind_14_03,`Conseil national - UDC`= Ind_14_04,`Conseil national - PEV/PCS`= Ind_14_05,`Conseil national - PVL`= Ind_14_06,`Conseil national - PBD`= Ind_14_07,`Conseil national - PST/Sol.`= Ind_14_08,`Conseil national - PES`= Ind_14_09,`Conseil national - Petits partis de droite`= Ind_14_10)# If no one voted for a party, set as NA -> replacing it with 0 insteadcommune <- commune %>%mutate_at(vars(starts_with("Conseil national")), ~replace_na(., 0))# Removing NAs from Taux de couverture sociale column# Setting the mean as the mean for Switzerland in 2019 (3.2%)mean_taux_aide_social <-3.2# Replace NA values with the meancommune <- commune %>%mutate(`Sécurité sociale - Taux d'aide sociale`=if_else(is.na(`Sécurité sociale - Taux d'aide sociale`), mean_taux_aide_social, `Sécurité sociale - Taux d'aide sociale`))#show 100 first rows of commune using reactablereactable(head(commune, 100))
Code
# commune_prep <- read.csv(file.path(here(),"data/commune_data.csv"), sep = ";", header = TRUE, stringsAsFactors = FALSE)# # # We keep only 2019 to have some reference? (2020 is apparently not really complete)# commune_2019 <- subset(commune_prep, PERIOD_REF == "2019") %>%# select(c("REGION", "CODE_REGION", "INDICATORS", "VALUE", "STATUS"))# # # delete les lignes ou Status = Q ou M (pas de valeur) et ensuite on enlève la colonne# commune_2019 <- subset(commune_2019, STATUS == "A") %>%# select(c("REGION", "CODE_REGION", "INDICATORS", "VALUE"))# # # on enlève les lignes qui sont des aggrégats# commune_2019 <- subset(commune_2019, REGION != "Schweiz")# # commune_2019 <- commune_2019 %>%# pivot_wider(names_from = INDICATORS, values_from = VALUE)# # # Rename columns using the provided map# commune <- commune_2019 %>%# rename(`Population - Habitants` = Ind_01_01,# `Population - Densité de la population` = Ind_01_03,# `Population - Etrangers` = Ind_01_08,# `Population - Part du groupe d'âge 0-19 ans` = Ind_01_04,# `Population - Part du groupe d'âge 20-64 ans` = Ind_01_05,# `Population - Part du groupe d'âge 65+ ans` = Ind_01_06,# `Population - Taux brut de nuptialité` = Ind_01_09,# `Population - Taux brut de divortialité` = Ind_01_10,# `Population - Taux brut de natalité` = Ind_01_11,# `Population - Taux brut de mortalité` = Ind_01_12,# `Population - Ménages privés` = Ind_01_13,# `Population - Taille moyenne des ménages` = Ind_01_14,# `Sécurité sociale - Taux d'aide sociale` = Ind_11_01,# `Conseil national - PLR` = Ind_14_01,# `Conseil national - PDC` = Ind_14_02,# `Conseil national - PS` = Ind_14_03,# `Conseil national - UDC` = Ind_14_04,# `Conseil national - PEV/PCS` = Ind_14_05,# `Conseil national - PVL` = Ind_14_06,# `Conseil national - PBD` = Ind_14_07,# `Conseil national - PST/Sol.` = Ind_14_08,# `Conseil national - PES` = Ind_14_09,# `Conseil national - Petits partis de droite` = Ind_14_10)# # # If no one voted for a party, set as NA -> replacing it with 0 instead# commune <- commune %>%# mutate_at(vars(starts_with("Conseil national")), ~replace_na(., 0))# # # # Removing NAs from Taux de couverture sociale column# # Setting the mean as the mean for Switzerland in 2019 (3.2%)# mean_taux_aide_social <- 3.2# # # Replace NA values with the mean# commune <- commune %>%# mutate(`Sécurité sociale - Taux d'aide sociale` = if_else(is.na(`Sécurité sociale - Taux d'aide sociale`), mean_taux_aide_social, `Sécurité sociale - Taux d'aide sociale`))#
3 Unsupervised learning
Clustering and/or dimension reduction
Trying to Cluster commune datas to: 1. Reduce dimension 2. See similarities
A regarder, est-ce qu’on fait un cluster pour les datas politques + un cluster pour les data démographiques, ou est-ce qu’on regroupe tout?
Code
set.seed(100)# Clustering demographiccols_commune_demographic <-select(commune, -c("REGION", "CODE_REGION","Conseil national - PLR","Conseil national - PDC", "Conseil national - PS", "Conseil national - UDC", "Conseil national - PEV/PCS", "Conseil national - PVL", "Conseil national - PBD", "Conseil national - PST/Sol.", "Conseil national - PES", "Conseil national - Petits partis de droite"))# Scale the columns, some are total numbers, some are percentagescols_commune_demographic <-scale(cols_commune_demographic)# Calculate the distance matrixdist_matrix_demographic <-dist(cols_commune_demographic, method ="minkowski")# Perform hierarchical clusteringhclust_model_demographic <-hclust(dist_matrix_demographic, method ="ward.D")# Create dendrogramdend_demo <-as.dendrogram(hclust_model_demographic)dend_demo <-color_branches(dend_demo, k =5) #Set number of cluster to 5, to keep the same scale for all our variablesplot(dend_demo, main ="Demographics - Hierarchical Clustering Dendrogram")
Code
# Clustering politicscols_commune_politics <-select(commune, c("Conseil national - PLR","Conseil national - PDC", "Conseil national - PS", "Conseil national - UDC", "Conseil national - PEV/PCS", "Conseil national - PVL", "Conseil national - PBD", "Conseil national - PST/Sol.", "Conseil national - PES", "Conseil national - Petits partis de droite"))# Scale the columns, some are total numbers, some are percentagescols_commune_politics <-scale(cols_commune_politics)# Calculate the distance matrixdist_matrix_politics <-dist(cols_commune_politics, method ="minkowski")# Perform hierarchical clusteringhclust_model_politics <-hclust(dist_matrix_politics, method ="ward.D")# Create dendrogramdend_pol <-as.dendrogram(hclust_model_politics)dend_pol <-color_branches(dend_pol, k =5) #Set number of cluster to 5, to keep the same scale for all our variablesplot(dend_pol, main ="Politics - Hierarchical Clustering Dendrogram")
To prevent introducing 10 new types of taxes, we conducted a clustering analysis on the tax dataset to identify which municipalities can be grouped together. Based on the within-cluster sum of squares, we found 5 clusters. These 5 distinct clusters will be assigned to properties to determine which municipalities are subject to a particular type of tax. ## Tax ::: {.cell layout-align=“center”}
Code
# Clean data and convert to numericcleaned_impots <-apply(impots, 2, function(x) as.numeric(gsub("[^0-9.-]", "", x)))cleaned_impots[is.na(cleaned_impots)] <-0# Replace NA values with 0# Scale the featuresscaled_impots <-scale(cleaned_impots)# Perform k-means clusteringk <-2# Initial guess for the number of clusterskmeans_model <-kmeans(scaled_impots, centers = k)# Check within-cluster sum of squares (elbow method)wss <-numeric(10)for (i in1:10) { kmeans_model <-kmeans(scaled_impots, centers = i) wss[i] <-sum(kmeans_model$withinss)}#plot(1:10, wss, type = "b", xlab = "Number of Clusters", ylab = "Within groups sum of squares")# Adjust k based on elbow methodk <-5# Perform k-means clustering again with optimal kkmeans_model <-kmeans(scaled_impots, centers = k)# Assign cluster labels to dendrogramclusters <- kmeans_model$cluster# Plot dendrogram#colored_dend <- color_branches(dend, k = 5)#y_zoom_range <- c(0, 80) # Adjust the y-axis range as needed#plot(colored_dend, main = "Hierarchical Clustering Dendrogram", horiz = FALSE, ylim = y_zoom_range)
# Preparing df_commune for merging with main datasetdf_commune <-select(commune, REGION)df_commune$Demographic_cluster <-cutree(hclust_model_demographic, k =5)df_commune$Political_cluster <-cutree(hclust_model_politics, k =5)# Preparing to mergemerging <-inner_join(amto_df, df_commune, by =c("Community"="REGION"))impots_cluster_subset <- impots_cluster[, c("Community", "cluster")]merging <- merging %>%left_join(impots_cluster_subset, by ="Community")clusters_df <- merging %>%rename(Tax_cluster = cluster) %>%rename(Commune = Community)clusters_df <- clusters_df %>%select(c("Commune", "zip_code", "Canton_code", "Demographic_cluster", "Political_cluster", "Tax_cluster"))# Only NAs are for commune Brugg, (written Brugg (AG) in the other data set) -> j'entre le cluster à la manoclusters_df$Tax_cluster[is.na(clusters_df$Tax_cluster)] <-2# adding it to our main data set:properties_filtered <-merge(properties_filtered, clusters_df[, c("zip_code", "Demographic_cluster", "Political_cluster", "Tax_cluster")], by ="zip_code", all.x =TRUE)
Code
# Dropping 228 rows containing NAs after the merge# Find rows with NA values in the specified columnsna_rows <-subset(properties_filtered, is.na(Demographic_cluster) |is.na(Political_cluster) |is.na(Tax_cluster))# Drop the NA rowsproperties_filtered <-anti_join(properties_filtered, na_rows, by ="zip_code")
Code
# Interpretaion of demographic clustersdemographic_vars <-select(commune, -c("REGION", "CODE_REGION", "Conseil national - PLR", "Conseil national - PDC", "Conseil national - PS", "Conseil national - UDC", "Conseil national - PEV/PCS", "Conseil national - PVL", "Conseil national - PBD", "Conseil national - PST/Sol.", "Conseil national - PES", "Conseil national - Petits partis de droite"))# Scale the variablesscaled_demographic_vars <-scale(demographic_vars)# Convert to data framescaled_demographic_vars <-as.data.frame(scaled_demographic_vars)# Add demographic cluster labelsscaled_demographic_vars$Demographic_cluster <-cutree(hclust_model_demographic, k =5)# Melt the dataset to long formatmelted_demographic <-melt(scaled_demographic_vars, id.vars ="Demographic_cluster")# Create boxplots for each variablefor (variable inunique(melted_demographic$variable)) {# Calculate quantiles for each combination of variable and cluster quantiles <-tapply(melted_demographic$value[melted_demographic$variable == variable], melted_demographic$Demographic_cluster[melted_demographic$variable == variable], quantile, c(0.05, 0.95))# Determine ylim for each plot ylim_min <-min(unlist(quantiles)) ylim_max <-max(unlist(quantiles))boxplot(value ~ Demographic_cluster, data = melted_demographic[melted_demographic$variable == variable,],main =paste("Boxplot of", variable, "by Demographic Cluster"),xlab ="Demographic Cluster",ylab = variable,ylim =c(ylim_min, ylim_max))}
Code
# Subset your dataset to include only the variables used to create the political clusters and the political cluster labelspolitical_vars <-select(commune, c("Conseil national - PLR","Conseil national - PDC", "Conseil national - PS", "Conseil national - UDC", "Conseil national - PEV/PCS", "Conseil national - PVL", "Conseil national - PBD", "Conseil national - PST/Sol.", "Conseil national - PES", "Conseil national - Petits partis de droite"))# Scale the variablesscaled_political_vars <-scale(political_vars)# Convert to data framescaled_political_vars <-as.data.frame(scaled_political_vars)# Add political cluster labelsscaled_political_vars$Political_cluster <-cutree(hclust_model_politics, k =5)# Melt the dataset to long formatmelted_political <-melt(scaled_political_vars, id.vars ="Political_cluster")# Create boxplots for each variablefor (variable inunique(melted_political$variable)) {# Calculate quantiles for each combination of variable and cluster quantiles <-tapply(melted_political$value[melted_political$variable == variable], melted_political$Political_cluster[melted_political$variable == variable], quantile, c(0.05, 0.95))# Determine ylim for each plot ylim_min <-min(unlist(quantiles)) ylim_max <-max(unlist(quantiles))boxplot(value ~ Political_cluster, data = melted_political[melted_political$variable == variable,],main =paste("Boxplot of", variable, "by Political Cluster"),xlab ="Political Cluster",ylab = variable,ylim =c(ylim_min, ylim_max))}
Code
# Subset your dataset to include only the variables used to create the tax clusters and the tax cluster labelstax_vars <-select(impots_cluster, -c("Community", "cluster"))# Scale the variablesscaled_tax_vars <-scale(tax_vars)# Convert to data framescaled_tax_vars <-as.data.frame(scaled_tax_vars)# Add tax cluster labelsscaled_tax_vars$Tax_cluster <- impots_cluster$cluster# Melt the dataset to long formatmelted_tax <-melt(scaled_tax_vars, id.vars ="Tax_cluster")# Create boxplots for each variablefor (variable inunique(melted_tax$variable)) {# Calculate quantiles for each combination of variable and cluster quantiles <-tapply(melted_tax$value[melted_tax$variable == variable], melted_tax$Tax_cluster[melted_tax$variable == variable], quantile, c(0.05, 0.95))# Determine ylim for each plot ylim_min <-min(unlist(quantiles)) ylim_max <-max(unlist(quantiles))boxplot(value ~ Tax_cluster, data = melted_tax[melted_tax$variable == variable,],main =paste("Boxplot of", variable, "by Tax Cluster"),xlab ="Tax Cluster",ylab = variable,ylim =c(ylim_min, ylim_max))}
4 Supervised learning
Data splitting (if a training/test set split is enough for the global analysis, at least one CV or bootstrap must be used)
Two or more models
Two or more scores
Tuning of one or more hyperparameters per model
Interpretation of the model(s)
4.1 Model 1
This section provides a comprehensive outline of the linear regression model and analysis methods employed in the study of property price determinants.
4.1.1 Tools and Software
The study was conducted using R, a programming language and environment widely recognized for its robust capabilities in statistical computing and graphics. Key packages used include:
corrplot for visualization of correlation matrices, aiding in preliminary feature selection. car for diagnostic testing and variance inflation factor (VIF) analysis to detect multicollinearity among predictors.
caret for creating training and testing sets, and managing cross-validation processes.
ggplot2 and plotly for creating visual representations of model diagnostics, predictions, and residuals.
gtsummary for creating publication-ready tables summarizing regression analysis results.
Each of these tools was chosen for its specific functionality that aids in robust data analysis, ensuring that each step of the model building and evaluation process is well-supported.
Initially, a correlation analysis was conducted to identify and visualize linear relationships between the property prices and potential predictive variables such as the number of rooms and square meters. The correlation matrix was computed and plotted using the corrplot package:
We can observe that the correlation between the number of rooms and price (0.02) and square meters and price (0.68) suggests a weak correlation with the number of rooms but a strong correlation with square meters. The number of rooms and square meters have a weak correlation (0.12), indicating they offer independent information for the model.
Question : How to make the correlation with categorical variables?
Question : Is VIF analysis really necessary and meaningful ?
4.1.2.2 Model Building
The dataset was split into training and testing sets to validate the model’s performance. The linear regression model was then fitted using selected predictors: ::: {.cell layout-align=“center”}
Diagnostic checks such as residual analysis and normality tests were conducted to validate model assumptions. Performance metrics including R-squared and RMSE were calculated to assess the model’s explanatory power and prediction accuracy.
Code
sum(is.na(testData$price)) # Check NAs in price#> [1] 0sapply(testData, function(x) sum(is.na(x))) # Check NAs in all predictors#> zip_code price number_of_rooms #> 0 0 0 #> square_meters address canton #> 0 0 0 #> property_type floor year_category #> 0 0 0 #> Community Canton_code lon #> 0 0 0 #> lat Demographic_cluster Political_cluster #> 0 0 0 #> Tax_cluster #> 0
Question : Do we need the plots ? Or can we delete them ?
Code
# Model Evaluation## Diagnostic Checks#plot(lm_model)#ad.test(residuals(lm_model))#use gt summary to show the resulttbl_reg_1 <- gtsummary::tbl_regression(lm_model)tbl_reg_1
Characteristic
Beta
95% CI1
p-value
number_of_rooms
16,483
4,030, 28,937
0.009
square_meters
8,621
8,413, 8,829
<0.001
property_type
Apartment
—
—
Attic flat
117,222
35,764, 198,679
0.005
Bifamiliar house
-214,970
-292,246, -137,693
<0.001
Chalet
189,879
91,980, 287,779
<0.001
Duplex
-119,296
-218,381, -20,212
0.018
Farm house
-510,786
-735,669, -285,903
<0.001
Loft
-158,775
-672,814, 355,263
0.5
Roof flat
-86,370
-201,502, 28,761
0.14
Rustic house
-18,453
-473,759, 436,854
>0.9
Single house
-141,364
-187,034, -95,694
<0.001
Terrace flat
-21,620
-185,933, 142,693
0.8
Villa
184,138
111,719, 256,557
<0.001
floor
floor
—
—
eg
19,216
-27,545, 65,978
0.4
noteg
year_category
0-1919
—
—
1919-1945
225,546
118,201, 332,890
<0.001
1946-1960
288,880
187,923, 389,837
<0.001
1961-1970
224,755
139,248, 310,263
<0.001
1971-1980
316,943
240,672, 393,213
<0.001
1981-1990
275,729
198,711, 352,747
<0.001
1991-2000
330,105
250,095, 410,115
<0.001
2001-2005
472,269
375,585, 568,953
<0.001
2006-2010
533,813
448,724, 618,902
<0.001
2011-2015
578,987
496,243, 661,732
<0.001
2016-2024
421,772
356,791, 486,752
<0.001
Demographic_cluster
8,871
-3,779, 21,520
0.2
Political_cluster
-47,861
-60,220, -35,503
<0.001
Tax_cluster
31,090
20,287, 41,893
<0.001
1 CI = Confidence Interval
4.1.2.3.1 Metrics
Here are the performance metrics for the initial model: ::: {.cell layout-align=“center”}
Code
# print R-squared and Adj R-squared and RMSE and MAE and AICr_sq <-summary(lm_model)$r.squaredadj_r_sq <-summary(lm_model)$adj.r.squaredrmse <-rmse(testData$price, predict(lm_model, newdata=testData))mae <-mae(testData$price, predict(lm_model, newdata=testData))aic <-AIC(lm_model)# show those metrics in a html tablemetrics_1 <-kable(data.frame(r_sq, adj_r_sq, rmse, mae, aic), format ="html", col.names =c("R-Squared", "Adjusted R-Squared", "RMSE", "MAE", "AIC")) %>%kable_styling(position ="center", bootstrap_options =c("striped", "bordered", "hover", "condensed")) %>%add_header_above(c("Basic Model Performance Metrics"=5)) metrics_1
Basic Model Performance Metrics
R-Squared
Adjusted R-Squared
RMSE
MAE
AIC
0.473
0.472
965171
493713
501375
:::
4.1.2.4 Variable Selection
Stepwise regression was performed to refine the model and improve its predictive performance. The resulting model was evaluated using the same diagnostic checks and performance metrics as the initial model:
Here we compare the performance metrics of the initial model and the stepwise model: ::: {.cell layout-align=“center”}
Code
# print R-squared and Adj R-squared and RMSE and MAE and AICr_sq <-summary(lm_model2)$r.squaredadj_r_sq <-summary(lm_model2)$adj.r.squaredrmse <-rmse(testData$price, predict(lm_model2, newdata=testData))mae <-mae(testData$price, predict(lm_model2, newdata=testData))aic <-AIC(lm_model2)# show those metrics in a html tablemetrics_2 <-kable(data.frame(r_sq, adj_r_sq, rmse, mae, aic), format ="html", col.names =c("R-Squared", "Adjusted R-Squared", "RMSE", "MAE", "AIC")) %>%kable_styling(position ="center", bootstrap_options =c("striped", "bordered", "hover", "condensed")) %>%add_header_above(c("Stepwise Model Performance Metrics"=5)) metrics_2
Stepwise Model Performance Metrics
R-Squared
Adjusted R-Squared
RMSE
MAE
AIC
0.473
0.472
965256
493985
501373
Code
metrics_1
Basic Model Performance Metrics
R-Squared
Adjusted R-Squared
RMSE
MAE
AIC
0.473
0.472
965171
493713
501375
:::
4.1.2.5 Cross-Validation
Cross-validation was used to assess the model’s robustness, typically to avoid overfitting and ensure that the model generalizes well to new data., using the caret package to manage this process efficiently. The CV results show metrics like RMSE and R-squared across folds, which indicate the model’s performance stability across different subsets of the data.
Write the comparison with stepwise model, seems robust ?
4.1.2.6 Model testing
The final model was tested using the unseen test dataset to evaluate its predictive accuracy. Residual plots and actual vs. predicted price plots were created to visually assess model performance:
4.1.2.6.1 Residual vs Predicted Prices
This plot shows residuals (differences between observed and predicted prices) against predicted prices. Ideally, residuals should randomly scatter around the horizontal line at zero, indicating that the model doesn’t systematically overestimate or underestimate prices.
Code
# Model Testing ## Test the Modelpredicted_prices <-predict(lm_model2, newdata=testData)testData$predicted_prices <- predicted_prices # Add to testData to ensure alignment# Calculate residualstestData$test_residuals <- testData$price - predicted_prices # Manually compute residuals# Residual Analysisgg <-ggplot(data = testData, aes(x = predicted_prices, y = test_residuals)) +geom_point() +geom_smooth(method ="lm", color ="blue") +labs(title ="Residuals vs Predicted Prices", x ="Predicted Prices", y ="Residuals")# Convert ggplot to plotlyp <-ggplotly(gg, width =600, height =400)# Show the interactive plotp
As we can observe, the spread of residuals suggests potential issues with model fit, particularly for higher price predictions where the variance seems to increase.
4.1.2.6.2 Actual vs Predicted Prices
This plot should ideally show points along the diagonal line, where actual prices equal predicted prices. The data clustering along the line suggests a generally good model fit, but here we can observe the spread which indicates variance in predictions, especially at higher price points. ::: {.cell layout-align=“center”}
Code
## Visualizationgg <-ggplot(data=testData, aes(x=predicted_prices, y=price)) +geom_point() +geom_smooth(method="lm", col="blue") +labs(title="Actual vs Predicted Prices", x="Predicted Prices", y="Actual Prices")# Convert ggplot to plotlyp <-ggplotly(gg, width =600, height =400)# Show the interactive plotp
:::
4.1.3 Linear Regression between 10th and 90th percentile
To solve this issue of variance at higher price points, we can filter the data to focus on a more specific range of prices. Here, we filter the data between the 10th and 90th percentiles of the price distribution to see if the model performs better within this range:
Code
#filter properties_filtered based on the 10th percentile and 90th percentile of the dataproperties_filtered <- properties_filtered %>%filter(price >=quantile(price, 0.1) & price <=quantile(price, 0.9))
# Model Evaluation## Diagnostic Checks#plot(lm_model)#ad.test(residuals(lm_model))#use gt summary to show the resulttbl_reg_1_10_90 <- gtsummary::tbl_regression(lm_model_10_90)tbl_reg_1_vs_10_90 <-tbl_merge(tbls=list(tbl_reg_1, tbl_reg_1_10_90),tab_spanner =c("**Basic Model (All Prices)**", "**Basic Model (10-90 Qt)**") )tbl_reg_1_vs_10_90
Characteristic
Basic Model (All Prices)
Basic Model (10-90 Qt)
Beta
95% CI1
p-value
Beta
95% CI1
p-value
number_of_rooms
16,483
4,030, 28,937
0.009
10,256
3,861, 16,650
0.002
square_meters
8,621
8,413, 8,829
<0.001
3,275
3,137, 3,414
<0.001
property_type
Apartment
—
—
—
—
Attic flat
117,222
35,764, 198,679
0.005
114,530
79,830, 149,230
<0.001
Bifamiliar house
-214,970
-292,246, -137,693
<0.001
115,447
83,066, 147,828
<0.001
Chalet
189,879
91,980, 287,779
<0.001
103,423
57,177, 149,668
<0.001
Duplex
-119,296
-218,381, -20,212
0.018
7,197
-34,184, 48,577
0.7
Farm house
-510,786
-735,669, -285,903
<0.001
140,236
41,807, 238,664
0.005
Loft
-158,775
-672,814, 355,263
0.5
-71,140
-267,068, 124,789
0.5
Roof flat
-86,370
-201,502, 28,761
0.14
3,948
-45,560, 53,456
0.9
Rustic house
-18,453
-473,759, 436,854
>0.9
5,551
-432,340, 443,443
>0.9
Single house
-141,364
-187,034, -95,694
<0.001
75,061
54,455, 95,667
<0.001
Terrace flat
-21,620
-185,933, 142,693
0.8
106,509
40,242, 172,777
0.002
Villa
184,138
111,719, 256,557
<0.001
137,452
103,836, 171,068
<0.001
floor
floor
—
—
—
—
eg
19,216
-27,545, 65,978
0.4
25,118
5,014, 45,222
0.014
noteg
year_category
0-1919
—
—
—
—
1919-1945
225,546
118,201, 332,890
<0.001
43,341
-6,273, 92,954
0.087
1946-1960
288,880
187,923, 389,837
<0.001
65,441
18,330, 112,551
0.006
1961-1970
224,755
139,248, 310,263
<0.001
161,702
120,957, 202,447
<0.001
1971-1980
316,943
240,672, 393,213
<0.001
160,274
124,123, 196,425
<0.001
1981-1990
275,729
198,711, 352,747
<0.001
160,031
124,110, 195,952
<0.001
1991-2000
330,105
250,095, 410,115
<0.001
154,502
117,626, 191,377
<0.001
2001-2005
472,269
375,585, 568,953
<0.001
301,291
256,964, 345,619
<0.001
2006-2010
533,813
448,724, 618,902
<0.001
335,584
296,532, 374,635
<0.001
2011-2015
578,987
496,243, 661,732
<0.001
322,095
284,105, 360,086
<0.001
2016-2024
421,772
356,791, 486,752
<0.001
219,761
189,129, 250,392
<0.001
Demographic_cluster
8,871
-3,779, 21,520
0.2
Political_cluster
-47,861
-60,220, -35,503
<0.001
-29,201
-34,495, -23,907
<0.001
Tax_cluster
31,090
20,287, 41,893
<0.001
-572
-5,230, 4,087
0.8
1 CI = Confidence Interval
4.1.3.2.1 Metrics
Here are the performance metrics for the model with prices between the 10th and 90th percentiles: ::: {.cell layout-align=“center”}
Code
# print R-squared and Adj R-squared and RMSE and MAE and AICr_sq <-summary(lm_model_10_90)$r.squaredadj_r_sq <-summary(lm_model_10_90)$adj.r.squaredrmse <-rmse(testData$price, predict(lm_model_10_90))#> Warning in actual - predicted: longer object length is not a#> multiple of shorter object lengthmae <-mae(testData$price, predict(lm_model_10_90, newdata=testData))aic <-AIC(lm_model_10_90)# show those metrics in a html tablemetrics_1_10_90 <-kable(data.frame(r_sq, adj_r_sq, rmse, mae, aic), format ="html", col.names =c("R-Squared", "Adjusted R-Squared", "RMSE", "MAE", "AIC")) %>%kable_styling(position ="center", bootstrap_options =c("striped", "bordered", "hover", "condensed")) %>%add_header_above(c("Basic Model Performance Metrics (10-90 Qt)"=5)) metrics_1_10_90
Basic Model Performance Metrics (10-90 Qt)
R-Squared
Adjusted R-Squared
RMSE
MAE
AIC
0.315
0.314
529342
296019
378363
:::
Here was the previous metrics of our first Basic model (without the 10-90 Qt filter) ::: {.cell layout-align=“center”}
Here are the performance metrics for the stepwise model with prices between the 10th and 90th percentiles as well as the comparison with the initial model: ::: {.cell layout-align=“center”}
Code
## Performance Metricsr_sq <-summary(lm_model2_10_90)$r.squaredadj_r_sq <-summary(lm_model2_10_90)$adj.r.squaredrmse <-rmse(testData$price, predict(lm_model2_10_90))#> Warning in actual - predicted: longer object length is not a#> multiple of shorter object lengthmae <-mae(testData$price, predict(lm_model2_10_90, newdata=testData))aic <-AIC(lm_model2_10_90)# show those metrics in a html tablemetrics_2_10_90 <-kable(data.frame(r_sq, adj_r_sq, rmse, mae, aic), format ="html", col.names =c("R-Squared", "Adjusted R-Squared", "RMSE", "MAE", "AIC")) %>%kable_styling(position ="center", bootstrap_options =c("striped", "bordered", "hover", "condensed")) %>%add_header_above(c("Stepwise Model Performance Metrics (10-90 Qt)"=5))metrics_2_10_90
Stepwise Model Performance Metrics (10-90 Qt)
R-Squared
Adjusted R-Squared
RMSE
MAE
AIC
0.315
0.314
529357
296009
378361
:::
Here was the previous metrics of our Basic Model (without Stepwise Integration) ::: {.cell layout-align=“center”}
Code
metrics_1_10_90
Basic Model Performance Metrics (10-90 Qt)
R-Squared
Adjusted R-Squared
RMSE
MAE
AIC
0.315
0.314
529342
296019
378363
:::
Here was the previous metrics of our stepwise model (without the 10-90 Qt filter) ::: {.cell layout-align=“center”}
Here was the previous metrics of our first Basic Model (without the 10-90 Qt filter) ::: {.cell layout-align=“center”}
Code
metrics_cv_1
10 Fold Cross Validation Metrics
intercept
RMSE
Rsquared
MAE
RMSESD
RsquaredSD
MAESD
TRUE
987246
0.476
506542
96757
0.061
24562
:::
4.1.3.5 Model testing
4.1.3.5.1 Residual vs Predicted Prices
Code
# Model Testing ## Test the Modelpredicted_prices <-predict(lm_model2_10_90, newdata=testData)testData$predicted_prices <- predicted_prices # Add to testData to ensure alignment# Calculate residualstestData$test_residuals <- testData$price - predicted_prices # Manually compute residuals# Residual Analysisgg <-ggplot(data = testData, aes(x = predicted_prices, y = test_residuals)) +geom_point() +geom_smooth(method ="lm", color ="blue") +labs(title ="Residuals vs Predicted Prices", x ="Predicted Prices", y ="Residuals")# Convert ggplot to plotlyp <-ggplotly(gg, width =600, height =400)# Show the interactive plotp
4.1.3.5.2 Actual vs Predicted Prices
Code
## Visualizationgg <-ggplot(data=testData, aes(x=predicted_prices, y=price)) +geom_point() +geom_smooth(method="lm", col="blue") +labs(title="Actual vs Predicted Prices", x="Predicted Prices", y="Actual Prices")# Convert ggplot to plotlyp <-ggplotly(gg, width =600, height =400)# Show the interactive plotp
5 EDA
5.1 Map representation of distribution of properties
Here we decided to represent the distribution of properties in Switzerland using a map. The map is interactive and allows users to hover over the markers to see the price. The markers are color-coded in orange and have a semi-transparent fill to reduce visual noise. The size of the markers is smaller to optimize the visual representation of the data.
This visualization helps in understanding the geographical spread and density of properties in the dataset.
Code
# Create a leaflet map with optimized markersmap <-leaflet(properties_filtered) %>%addTiles() %>%# Add default OpenStreetMap tilesaddProviderTiles(providers$Esri.NatGeoWorldMap) %>%# Add topographic maps for contextaddCircleMarkers(~lon, ~lat,radius =1.5, # Smaller radius for the circle markerscolor ="#F97300", # Specifying a color for the markersfillOpacity =0.2, # Semi-transparent fillstroke =FALSE, # No border to the circle markers to reduce visual noisepopup =~paste("Price: ", price, "<br>","Rooms: ", number_of_rooms, "<br>","Type: ", property_type, "<br>","Year: ", year_category),label =~paste("Price: ", price) # Tooltip on hover ) %>%addLegend(position ="bottomright", # Position the legend at the bottom rightcolors ="#F97300", # Use the same color as the markerslabels ="Properties"# Label for the legend )map$width <-"100%"# Set the width of the map to 100%map$height <-600# Set the height of the map to 600 pixelsmap
The map highlights that properties are well-distributed throughout the region, with fewer properties in the Alpine region, which is expected due to its mountainous terrain. We thus have a good representation of the data across different cantons and locations and we can use it for further analysis.
5.2 Histogram of prices
Code
histogram_price <-ggplot(properties_filtered, aes(x = price)) +geom_histogram(binwidth =100000, fill ="skyblue", color ="red") +labs(title ="Distribution of Prices",x ="Price",y ="Frequency") +theme_minimal()# Convert ggplot object to plotly objectinteractive_histogram_price <-ggplotly(histogram_price, width =600, height =400 )# Display the interactive histograminteractive_histogram_price
5.3 Price between 0 and 500000
5.3.1 Histogram of prices for each property type
Code
# Create the ggplot objecthistogram <-ggplot(properties_filtered, aes(x = price)) +geom_histogram(binwidth =100000, fill ="skyblue", color ="black") +facet_wrap(~ property_type, scales ="free", ncol =2) +labs(title ="Distribution of Prices by Property Type",x ="Price",y ="Frequency") +theme_minimal() +xlim(0, 5000000)# Convert ggplot object to plotly objectinteractive_histogram <-ggplotly(histogram, width =600, height =1000)# Display the interactive plotinteractive_histogram
5.4 Histogram of prices for each year category
note : only price between 0 and 500000 so some outliers aren’t here
Code
# Create a histogram of prices for each year categoryhistogram <-ggplot(properties_filtered, aes(x = price)) +geom_histogram(binwidth =100000, fill ="skyblue", color ="black") +facet_wrap(~ year_category, scales ="free", ncol =2) +labs(title ="Distribution of Prices by Year Category",x ="Price",y ="Frequency") +theme_minimal() +xlim(0, 5000000)# Convert ggplot object to plotly objectinteractive_histogram_year <-ggplotly(histogram, width =600, height =1000)# Display the interactive plotinteractive_histogram_year
5.5 Histogram of prices for each canton
note : only price between 0 and 500000 so some outliers aren’t here
Code
histogram <-ggplot(properties_filtered, aes(x = price)) +geom_histogram(binwidth =100000, fill ="skyblue", color ="black") +facet_wrap(~ canton, scales ="free", ncol =2) +labs(title ="Distribution of Prices by Canton for properties between 0 and 5 million",x ="Price",y ="Frequency") +theme(axis.text.y =element_text(size =2)) +theme_minimal() +xlim(0, 5000000)# Convert ggplot object to plotly object with adjusted heightinteractive_histogram <-ggplotly(histogram, width =600, height =1000) #%>%#layout(height = 1000) # Adjust the height as needed# Display the interactive plotinteractive_histogram
5.6 Histogram of prices for each number of rooms
note : only price between 0 and 500000 so some outliers aren’t here
Code
# Create a histogram of prices for each number of roomshistogram <-ggplot(properties_filtered, aes(x = price)) +geom_histogram(binwidth =10000, fill ="skyblue", color ="black") +facet_wrap(~ number_of_rooms, scales ="free", ncol =2) +labs(title ="Distribution of Prices by Number of Rooms",x ="Price",y ="Frequency") +theme_minimal() +xlim(0, 5000000)# Convert ggplot object to plotly object with adjusted heightinteractive_histogram <-ggplotly(histogram, width =600, height =1000)%>%layout(height =2000)# Display the interactive plotinteractive_histogram
5.7 Histogram of properties by square meters
Code
histogram <-ggplot(properties_filtered, aes(x = square_meters)) +geom_histogram(binwidth =15, fill ="skyblue", color ="black") +labs(title ="Distribution of Properties by Square Meters",x ="Square Meters",y ="Frequency") +theme_minimal() #xlim(0, 2000)# Convert ggplot object to plotly object with adjusted heightinteractive_histogram <-ggplotly(histogram, width =NULL, height =NULL) # Adjust width and height as needed# Display the interactive plotinteractive_histogram
5.8 Histogram of prices with impot
Code
# # Create the boxplot# boxplot <- ggplot(properties_filtered, aes(x = as.factor(Tax_cluster), y = price)) +# geom_boxplot(fill = "skyblue", color = "black") +# labs(title = "Boxplot of Property Prices by Tax Cluster",# x = "Tax Cluster",# y = "Price") +# theme_minimal() +# ylim(100000, 400000)# # # Convert ggplot object to plotly object# interactive_boxplot <- ggplotly(boxplot)# interactive_boxplot
Code
impot_cols <-names(properties_filtered)[startsWith(names(properties_filtered), "Impôt")]# Count the number of NA values in selected columnsna_counts <-colSums(is.na(properties_filtered[impot_cols]))# Print the countsprint(na_counts)#> numeric(0)